Additional documentation (such as version history) can be found in the <a href="file://ReadMe.txt">ReadMe.txt</a> file.
<a name="INTRODUC"><H1>Introduction</H1>
<a name="FIRSTUQU"><H2>First-Time User Questions</H2>
<hl>What does TextHarvest do?</hl>
It reads files and copies the information you want, altering it in various
ways you specify.
<hl>What kinds of files can it process?</hl>
Text (Windows, DOS, Unix, Macintosh), fixed-record-length, and character-
terminated.
<hl>How much does TextHarvest cost?</hl>
For most people: nothing. Certain exceptionally powerful capabilities
require that you purchase a special license, but the average user will
not need these.
<hl>Can I give copies of TextHarvest to other people?</hl>
Yes.
<hl>Can I sell copies of TextHarvest to other people?</hl>
No. You can charge a small distribution fee, though.
<hl>I'm a programmer, so why would I need TextHarvest?</hl>
Many operations that would take 100 lines of code in a traditional
programming language can be performed with a single script command.
<a name="OVERVIEW"><H2>General Overview</H2>
In its simplest form, TextHarvest is a utility that copies a text file. As it does so, it can:
<HL>
ò Retain lines that contain specific text
ò Skip lines that contain specific text
</HL>
Thus, you can use TextHarvest to filter a text file, preserving only those lines that interest you.
You can also use powerful <a href="#SCRIPTNG">scripts</a> to:
<hl>
ò Process files other than plain text
ò Modify the data sent to the output file
ò Perform further filtering and analysis
</hl>
Here are some typical scripted operations:
<hl>
ò Change one word to another one
ò Change uppercase to lowercase
ò Rearrange columns of text
ò Look up values in a table
ò Convert data to CSV (Comma Separated Value) format
</hl>
But for now, let's start with the basics ...
<a name="SIMPEXAM"><H2>A Simple Example</H2>
TextHarvest comes with a demonstration file, named <a href="file://ThingsToDo.txt">ThingsToDo.txt</a> (click to view), which contains a simple "To Do" list. You can view this file by entering the name in the "Input File" box then clicking the corresponding View button.
The first column contains a category, such as "Car, Home, Work", while the second column describes the task to be done.
Let us say you only wanted to see the lines that contained the word "Work". Here is the way to do this:
<HL>
ò Specify the input file name (ThingsToDo.txt)
ò Specify an output file name (Output.txt)
ò Make sure Autoview is checked and Append is <i>not</i> checked
ò Make sure the "Script file" input box contains the word "None" (no quotes)
ò Put the word "Work" in the "/Keep list" like this: /Work
ò Click the Start button (shortcut key: F9)
</HL>
TextHarvest will then read <a href="file://ThingsToDo.txt">ThingsToDo.txt</a> and copy only those lines that contain the word "Work" (or variations such as "WORK", or "work"). Then, because you checked "Autoview", the output file (Output.txt) will be displayed.
Note: The reason we put a slash ("/") character in front of the word "Work" is explained later, in the section "<a href="#SPECALOW">Specifying a List of Words</a>".
<a name="ANOTEXAM"><H2>Another Example</H2>
Now let us suppose that you want to do the opposite of what you did in the previous example: you want to see every line <i>except</i> those that contain the word "Work". Remove the word "Work" from the "/Keep list" input box and put it in the "/Delete list" box like this:
<HL> /Work</HL>
When you click the Start button, TextHarvest will copy the file (<a href="file://ThingsToDo.txt">ThingsToDo.txt</a>) to the output file but will remove any lines that contain the word "Work".
<a name="THBASICS"><H1>TextHarvest Basics</H1>
<a name="WORDLSTS"><H2>Word Lists</H2>
<a name="SPECALOW"><H3>Specifying a List of Words</H3>
Once again using <a href="file://ThingsToDo.txt">ThingsToDo.txt</a>, let us copy only those lines that contain the word "Home", or "Work", or both.
Make sure that the "/Delete list" input box is empty, then enter the following in the "/Keep list" input box:
<HL> /home/work</HL>
When you click the Start button, the file will be copied ù but only those lines that contain "Home" or "Work" (with variations, such as "HOME", "Work" and so on).
<a name="COMBKAND"><H3>Combining Keep and Delete</H3>
You can specify both Keep and Delete lists. For example, let us say you used the following criteria:
<HL>
/Keep list: /work/home
/Delete list: /inventory
</HL>
This would copy any lines with the words "work" or "home", but which do <i>not</i> contain the word "inventory".
When Keep and Delete lists are both specified, a line is first checked to see if it passes the "Keep" test. If so, it is then compared to the "Delete" list. If a match is found, the line is <i>not</i> copied.
By default, TextHarvest will ignore text case when looking at the "/Keep list" and the "/Delete list". You can override this behaviour, though, using the "/Controls" input box. Here are the various settings:
<HL>
/KI = Ignore case on Keep (default)
/KM = Match case on Keep
/DI = Ignore case on Delete (default)
/DM = Match case on Delete
</HL>
Try using the sample input file <a href="file://ThingsToDo.txt">ThingsToDo.txt</a> to test this out:
<HL>
ò Make sure the "Script file" input box contains the word "None" (no quotes)
ò Set your "/Keep list" to "/CAR/work"
ò Make sure your "/Delete list" is empty
ò Set your "/Controls" input box to "/KM"
ò Click the Start button
</HL>
The output will contain references to "CAR", but will ignore the lines that start with "WORK" because "WORK" (which is in uppercase) does not match "work" (which is in lowercase).
<a name="NULLLINE"><H3>Null Lines</H3>
By default, TextHarvest ignores all null (zero-length) lines in the input file. However, you can set the "/Controls" input box to deal with this. Here are the settings:
<HL>
/NI = Ignore null lines (default)
/NK = Keep null lines
/NS = Keep null lines, but never output more than two in a row
</HL>
Try using the sample input file <a href="file://ThingsToDo.txt">ThingsToDo.txt</a> to test this out...
<HL>
ò Make sure the "Script file" input box contains the word "None" (no quotes)
ò Clear the "/Keep list" input box
ò Set "/Delete list" to "/car/work"
ò Set "/Controls" to "/NK".
ò Click the Start button
</HL>
The output will <i>not</i> contain any lines containing "car" or "work", but it <i>will</i> contain any null lines found in the input file.
Try the experiment again, first with "/NS" and then with "/NI". (Since "/NI" is the default, you could also simply leave the "/Controls" input box blank.)
<a name="FNAMEANO"><H3>File Name Annotation</H3>
If you are processing multiple files using <a href="#WILDCRDS">wildcards</a>, you may wish to know which output lines came from which files. TextHarvest can annotate the output such that the file name precedes lines extracted from a particular file:
<HL>
/FN = No, do not output the file name (default)
/FY = Yes, output the file name
/FS = Yes, output the file name, and put separator lines above and below
</HL>
The separator line (control /FS) makes it easier to spot the file names in a long output file.
Only the file names of files that actually generate output lines are included. If a file does not generate any input lines, its name is not mentioned.
File name annotation lets you use TextHarvest as a "Find Text" utility. For example, if you wanted to search a folder for the word "inventory", you could do this:
<HL>
ò Set the "Input file" box to the wildcard "Things*.txt" (without the quotes)
ò Make sure the "Script file" input box contains the word "None"
ò Set the "/Keep list" input box to "/inventory"
ò Make sure your "/Delete list" is empty
ò Set the "/Controls" input box to "/FS" or "/FY"
ò Click the Start button
</HL>
The example given above would search all files matching the wildcard pattern Things*.txt extension for the word "inventory".
<a name="REGEXP00"><H3>Regular Expressions</H3>
By default, TextHarvest will search for the precise text fragments you specify in the /Keep and /Delete lists. However, you can enable "regular expressions", which let you match patterns rather than specific sequences of characters:
<HL>
/KR = Enable regular expressions for the /Keep list
/DR = Enable regular expressions for the /Delete list
</HL>
Consider the following /Keep list:
<HL> /D.g/C[aou]t</HL>
With /KR specified in the "/Controls" input box, this would match any line that contained "Dog", "Cat", "Cot", "Cut". It would also match lines containing "Dig" and "D3g", so when you are using regular expressions you must ensure that you are indicating precisely what you want.
If you have never used regular expressions before, you may find them a bit confusing at first, but with a bit of practice you will come to appreciate just how much power they put at your fingertips.
Please see "<a href="#REGEXPRS">Regular Expression Syntax</a>" for additional examples of regular expressions.
<a name="OTHRCONT"><H2>Other Controls</H2>
<b><HL>Autoview</HL></b>, if checked, displays the output file after processing (if there is anything to display). If it is not checked, you have to click the View button to see the output.
<b><HL>Append</HL></b>, if checked, places the output at the end of the specified output file. If it is not checked, the original copy of the output file (if it exists) is renamed with a .BAK extension and a new version is created.
<a name="ADVANCED"><H1>Advanced Techniques</H1>
<a name="SHORTCTK"><H2>Shortcut Keys</H2>
In addition to the standard Windows shortcut key conventions (i.e. pressing Alt plus a letter that is underlined), the following shortcut keys are defined:
The Esc (Escape) key will close most windows opened by TextHarvest.
<a name="WILDCRDS"><H2>Wildcards</H2>
You can process more than one input file at a time by using wildcards. For example, if you set the input file box to *.txt then all files with a .txt extension will be processed. Here are some more examples:
Note that the asterisk (*) is interpreted differently in <a href="#REGEXP00">regular expressions</a> than it is in file name wildcards.
You cannot specify wildcards for the output file. All output goes to a single output file.
<a name="MULTWDCS"><H3>Multiple Wildcards</H3>
You can specify multiple wildcards by using semicolons, as in this example:
<HL> *.txt;*.me</HL>
This would process input files with the .txt exension (example: xyz.txt) and the .me extension (example: read.me).
There is no limit to the number of wildcards you specify, but bear in mind that TextHarvest lets you process the same file more than once. Consider this example:
<HL> *.txt;my*.txt</HL>
This would process all files with a .txt extension, then all files with a .txt extension where the file name starts with "my". Thus, a file named "myfile.txt" would be processed twice.
You cannot specify multiple file names for the output file. All output goes to a single output file.
<a name="CLIPBORD"><H2>Processing the Windows Clipboard</H2>
TextHarvest can read and write to the Windows text clipboard as if it was a regular text file. To read from the clipboard, enter CLIPBOARD in the "Input File" box. To write to the clipboard, enter CLIPBOARD in the "Output File" box.
It is possible to do both at once. Of course, after processing, the original contents of the clipboard will have been overwritten.
Tip: Most Windows programs let you copy selected text with Ctrl-C and paste with Ctrl-V.
Note: You can use the sample file <a href="file://ThingsToDo.txt">ThingsToDo.txt</a> to try out the examples given below. The examples should be entered in your /Keep list. Make sure that the "/Delete" and "/Controls" input boxes are empty, and that the "Script file" input box is set to "None".
The lists of words (see "<a href="#WORDLSTS">Word Lists</a>") you enter in the "/Keep list" and "/Delete list" input boxes are typically a sequence of alternatives. For example, if your /Keep list is "/Cat/Dog/Cow" it means you want to keep lines that contain "Cat" or "Dog" or "Cow". This is called an "OR-list".
However, sometimes you want to keep lines that contain all of the words you listed. That is to say, if even one of the words is missing, you don't want to keep the line. For this you need an "AND-list".
TextHarvest's AND function is represented by two ampersands. Here is an example of ANDing...
<HL> /Cat&&/Dog&&/Cow</HL>
This will match any line that contains all three (Cat, Dog and Cow).
You can combine ANDing and ORing, as in this example:
<HL> /Cat/Dog/Cow&&/Moose</HL>
This will match any line that contains any one of the first three items (Cat or Dog or Cow) AND also contains the word Moose.
Now consider this example:
<HL> /Cat/Dog/Cow&&/Moose/Antelope</HL>
This will match any line that contains one of the first three items (Cat or Dog or Cow) AND also contains one of the next two items (Moose or Antelope).
If <i>any</i> of the AND conditions is not met, the line does not match. For example, consider this list:
<HL> /North/South&&/Up/Down&&/Back/Forth</HL>
A line that contains North, Up and Back would match. A line that contains South, Down and Back would match. But a line that is missing both North and South would <i>not</i> match.
<a name="CMDLINEPS"><H2>Command Line Parameters</H2>
To call TextHarvest from the command line (e.g. from a <a href="#BATCHFIL">batch file</a> or in a Windows shortcut), the following format is used:
You can also specify the /Keep, /Delete and /Controls lists:
<HL>
/X"/keep/list"
/Y"/delete/list"
/Z"/control/list"
</HL>
To specify a script file, use /S as in this example:
<hl>
/S"ScriptSample01.txt"
</hl>
If you are not using a script, you should specify /S"None" to override whatever value TextHarvest had previously saved for that input box.
For a general overview of command line parameters, start up TextHarvest as follows:
<HL> TextHarvest /?</HL>
This displays a window which summarizes the command-line options. The window is also displayed if your command line contains an option that TextHarvest does not recognize.
When calling TextHarvest from a batch file, you must use the Windows START command with the /WAIT option to allow TextHarvest to complete processing before moving to the next line in the batch file.
If the batch file is running unattended, you should also feed TextHarvest the following parameters:
If a serious error occurs during processing, TextHarvest creates a file named TextHarvest-Error.txt in its program directory. The file is plain text and contains information about the error. You can view the Error Reporting File using the "Support Files" input box of the Parsing Parameters window; it will be listed in the drop-down list.
If no error occurs, the file is <i>not</i> present after processing is complete.
If you are using TextHarvest in a batch file, you can check to see if processing worked by using the IF EXIST test, as in this example:
Note that the /CA parameter suppresses pop-up error messages, so if you use it in your batch file, it is up to your batch file to watch for the error file and then determine what to do if an error (such as "File not found") occurs.
<a name="MSGLOGFL"><H3>The Log File</H3>
In addition to the <a href="#ERRORFIL">Error Reporting File</a>, TextHarvest also creates a log file (named TextHarvest-Log.txt). TextHarvest uses the log file to record the date and time when processing started and ended. It also uses the log file to report anything that is slightly unusual but not a serious problem.
You can view the Log File using the "Support Files" input box of the Parsing Parameters window; it will be listed in the drop-down list.
<a name="USAGNOTS"><H1>Usage Notes</H1>
<a name="MATCHPRB"><H2>Matching Problems</H2>
<a name="FALSMATC"><H3>False Matches</H3>
Sometimes TextHarvest matches on strings of characters that you do not want matched. For example, if you set your /Keep list to /home/car while copying the sample file <a href="file://ThingsToDo.txt">ThingsToDo.txt</a> you will find that an additional line is included:
<HL> WORK Buy toner cartridge for laser printer</HL>
This was included because the characters "car" appear in the word "cartridge". You can get around this by explicitly indicating the space after "car":
<HL> /home/car /</HL>
An alternative solution in this particular case would be to set the /Keep list to "/HOME/CAR" and the /Controls setting to "/KM" (Keep: match case).
<a name="FINDSLAS"><H3>Finding Slashes</H3>
You will normally separate the words in your /Keep and /Delete lists with the slash ("/") character (e.g. "/home/work"). But what if you are looking for a slash? All you need to do is begin your word list with a different character, such as the "backslash" character ("\").
You can try processing the sample input file <a href="file://ThingsToDo.txt">ThingsToDo.txt</a> with the following "/Keep list" to see that this works as it should:
<HL> \home\work</HL>
In other words, the first character in the list becomes the delimiter which separates the words.
<a name="FILEFMTS"><H2>File Formats</H2>
If you do not use scripts, TextHarvest can read either Windows-style (CRLF-terminated) text files or Unix-style (LF-terminated) text files, and output is always a Windows style (CRLF-terminated) text file.
If you do use scripts, TextHarvest can read all standard text files (including the Macintosh variety), fixed-record-length files, and character-terminated records, while output can be whatever you want it to be.
Note: In the following examples, we assume that case sensitivity has been turned on, using the /KM or /DM setting in the "/Controls" input box.
Here are some examples of matches:
<HL>
C.t Matches Cat, Cot, Cut, Cxt, C3t etc.
C[aou]t Matches Cat, Cot, Cut only
B..d Matches Bird, Bred, Bead etc.
^Dog Matches Dog only if it is at the beginning of a line
Moose$ Matches Moose only if it is at the end of a line
Pa*d Matches Pd, Pad, Paad, Paaad etc.
</HL>
<a name="RXAST"><H3>Using the Asterisk</H3>
The last example given above uses the * character to indicate zero, one or more occurrences of a particular character ù in this case, the letter "a". Unlike the * wildcard character used in file names, it does not match "any" character but is specific. That is why "Pa*d" would not match "Parsed"; the asterisk means "match zero or more of the preceding character specification".
If you actually want to search for "Pa" followed by one or more letters and then "d", the correct syntax is:
<HL> Pa[a-z][a-z]*d</HL>
This means that we want to match "Pa", then a letter in the range from "a" to "z", then some number (including zero) of characters in the "a" to "z" range, and finally the letter "d". The character string "Parsed" would meet these criteria, as would "Pad", "Paid" and "Packed".
[0-9][0-9]* Matches numbers such as 0, 1, 01, 10, 25, 0990, 9999 etc.
-[0-9][0-9]* Matches negative numbers such as -0, -1, -19, -12345 etc.
</HL>
In the last example, [0-9] is specified twice to ensure that at least one digit is found. Bear in mind that the * character means "zero or more occurrences". If you had specified "-[0-9]*" you would get a match within the sequence "Hello - there", since the "-" character is indeed found, followed by zero occurrences of the digits 0 through 9.
You can create fairly complex patterns using regular expressions. Consider this example:
<HL> \$[0-9][0-9]*\.[0-9][0-9]</HL>
This would match dollar amounts with two decimal places, such as $0.00, $03.23, $3.14, $9.99, $1234.56 and so on.
<a name="SCRIPTNG"><H1>Scripting</H1>
Parse-O-Matic Scripting lets you modify the results generated by TextHarvest.
Scripting can examine the text lines that are retained after TextHarvest's /Keep and /Delete settings are taken into account. You could, for example:
<hl>
ò Replace one string of text with another one
ò Convert some of the line to uppercase
ò Eliminate certain lines on the basis of multiple criteria
ò Rearrange the order of data items in a line
ò Add up numbers and include totals at the end of the output
</hl>
All this ù and much, much more ù is possible with Parse-O-Matic Scripting.
When using a script you will generally leave the /Keep and /Delete input boxes empty, since the script can do this kind of selection. The /Controls input box can be set to /NK to keep null lines or /NI (default) to ignore null lines.
<a name="SCSEXAMP"><H2>A Simple Example</H2>
Here is a very simple example, using the sample <a href="file://ThingsToDo.txt">ThingsToDo.txt</a> file. Let us say you wanted to convert the "category" (CAT, CAR, HOME, WORK, LEISURE) to lowercase. To do this, you would use a text editor program to write a script file (let's call it <a href="file://ScrExperiment.txt">ScrExperiment.txt</a>) that looks like this:
<hl>
Category = $OutData[1 9]
Description = $OutData[10 999]
Category = ChangeCase Category 'Lowercase'
OutEnd Category Description
</hl>
The first two lines extract the two parts of the output data from the variable named $OutData, which contains the line of text from TextHarvest. The third line converts the category to lowercase, while the final line sends the modified line to the output file. (Whenever you run TextHarvest's results through a script, it is up to the script to actually send the lines to the output file.)
To run this script, you would enter its name ù we called it <a href="file://ScrExperiment.txt">ScrExperiment.txt</a> ù in the "Script File" input box of the Parsing Parameters window, then click the Start button.
If you do <i>not</i> want to run a script ù i.e. you simply want to use TextHarvest as a basic filter ù enter "None" (without the quotes) in the "Script File" input box.
<a name="SCRUSERM"><H2>Scripting User Manual</H2>
A complete user manual for Parse-O-Matic Scripting is included with TextHarvest.
Click <a href="exec:PommelScan">here</a> to access the "Parse-O-Matic Scripts" user manual.
<a name="SCSAMPLE"><H2>Sample Scripts</H2>
Here is a list of the sample scripts included with TextHarvest:
Adv = Uses Advanced Scripting commands (see the Scripting user manual).
</hl>
It is best to study these scripts in the order they are listed above. To view a script, click on the button with the folder icon next to the "Script File" input box. You can then select a script and view it by clicking the View button.
To try out the sample script <a href="file://ScriptSample01.txt">ScriptSample01.txt</a>:
<hl>
ò Set your "Input File" box to ThingsToDo.txt
ò Set the "Output File" box to an appropriate file name (e.g. Output.txt)
ò Make sure Autoview is checked and Append is <i>not</i> checked
ò Clear your "/Keep list", "/Delete list" and "/Controls" input boxes
ò Set the "Script File" input box to ScriptSample01.txt
ò Click the Start button
</hl>
Once the output file is displayed, you may find it helpful to also view the input file, so you can understand how the output data was transformed.
If you should need to uninstall TextHarvest, start up the Windows Control Panel, then click on Add/Remove Programs. Find TextHarvest on the list, and proceed with removal.
<a name="CUSTOMCO"><H1>Custom Conversion</H1>
TextHarvest is handy and simple to use, but it has its limitations. That is a perennial problem with utilities: there always seems to be one feature missing ù one that you urgently need!
We invite you to visit <a href="http://www.parse-o-matic.com">our web site</a> if you need a custom conversion application. Our company has been doing data conversion since 1985.
<a name="LEGALESE"><h1>Legal Notices</h1>
TextHarvest<fs x="1.5"></fs> and Parse-O-Matic<fs x="1.5"></fs> are trademarks of Pinnacle Software.<fs x="0.75">
The entire product (comprising software, documentation and supporting provisions) is presented as-is; we make no claim about (and disavow liability for) its suitability, accuracy, reliability, performance etc. If you should encounter a problem with the product, please <a href="mailto:info@parse-o-matic.com">write to us</a> to find out if a solution is available.